Due to the rise of usage of virtual systems, support ticket systems have come into prominence. Addressing the issue tickets to appropriate person or unit in the support team has critical importance in order to provide improved end user satisfaction while ensuring better allotment of support recourses. The assignment of help ticket to appropriate group is still manually performed. Especially at large organizations, the manual assignment is not applicable sufficiently. It is time consuming and requires human efforts. There may be mistakes due to human errors. Also resource consumption is carried out ineffectively because of the misaddressing.
In this project, machine learning techniques and other algorithms which proven performance in text processing are used to classify the tickets to the correct assignment groups
The goal of the project is to build a classifier that can classify the tickets by analysing text.
The overall objective of this project are:
Details of the dataset is in the below link:
https://drive.google.com/file/d/1OZNJm81JXucV3HmZroMq6qCT2m7ez7IJ/edit
The dataset consists of incident tickets information which are assigned to specfic groups. This dataset has 8500 rows with 4 columns.
The target column 'Assigmnent Group' has 74 values.
!pip install langdetect
!pip install Unidecode
!pip install googletrans
!pip install spacy
!pip install plotly
!pip install xlrd
!pip install wordcloud
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
#Import the necessary libraries
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(color_codes = True)
from pandas import DataFrame
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize,sent_tokenize
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from matplotlib import pyplot as plt
import string
import unidecode
import re
import spacy
from keras.regularizers import L1L2
from tensorflow.keras import regularizers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional,MaxPooling1D ,SpatialDropout1D
from tensorflow.keras.models import Model, Sequential
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import preprocessing
from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, Flatten
from tensorflow.keras.layers import GlobalAveragePooling1D, Embedding, LSTM
from tensorflow.keras.models import Model
from langdetect import detect_langs
from langdetect import detect
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.metrics import confusion_matrix, classification_report, auc
from sklearn.metrics import roc_curve, accuracy_score, precision_recall_curve,f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from langdetect import detect_langs
from langdetect import detect
import gensim
from gensim.models.phrases import Phraser, Phrases
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
import zipfile
import datetime
import sys
from tqdm import tqdm
tqdm.pandas()
import gc
import os
from google.colab import drive
drive.mount('/content/drive')
project_path = '/content/drive/My Drive/CapstoneProject/'
#Reading the data to a dataframe for further processing
tickets_corpus = pd.read_excel((project_path + 'input_data.xlsx'), encoding='utf8')
#
#Displaying the first 10 rows from the data
tickets_corpus.head(5)
#Reading the last 10 rows from the dataset
tickets_corpus.tail(10)
We have successfully read the data and stored in the dataframe.
#Shape of the dataset
tickets_corpus.shape
tickets_corpus.info()
tickets_corpus.describe()
The dataset comprises of 8500 rows and 4 columns. All columns are of type object containing textual information. Password reset is one of the most occuring tickets which reflects in the Short description column. The top occuring Description in the dataset is 'the', it is meaningless and we have to deal with it. Also we can see the top caller is 'bpctwhsn kzqsbmtp' and top or most frequent assignment group is GRP_0
#Checking if the data set has any NULL OR NAN Values
tickets_corpus.isna().sum()
We have very few NaN in the dataset in Short Description and Description column.
#displaying the data where 'Description' is null
tickets_corpus[tickets_corpus['Description'].isna()]
#displaying the data where 'Short description' is null
tickets_corpus[tickets_corpus['Short description'].isna()]
As per the above analysis, where we have missing values in the 'Description' column, the corresponding 'Short description' value is present. Also where 'Short description' column has NaN values, the corresponding 'Description' column values are present. Further processing, we are going to merge these two columns, so we no need to delete these NaN rows. But since most of them are from GRP_0, we are not keeping these rows since already the data is too much biased towards the GRP_0.
# We will Drop these rows from the dataset
tickets_corpus.dropna(inplace=True)
print ('Shape of the dataset after Dropping NAN values', tickets_corpus.shape)
'Assignment group' column
#Lets check the assignment groups and the corresponding ticket count in each group
group_ids = tickets_corpus['Assignment group'].str.split(expand=True).stack().value_counts()
print ('Number of Incidents under each unique incident group type\n', group_ids)
There are totally 74 groups , and GRP_0 has the highest number of tickets 3968 out of 8500.
#Lets Select all ticket Assignment groups which have only one ticket
print (tickets_corpus[tickets_corpus.groupby("Assignment group")["Assignment group"].transform('size') == 1]['Assignment group'].unique())
we have around 6 assignment groups which have only one ticket sample.
Short description Column
#Length of each 'Short desccription'
tickets_corpus['short_desc_len'] = tickets_corpus['Short description'].astype(str).apply(len)
#Lets get the number of words in each 'Short description'
tickets_corpus['short_des_word_count'] = tickets_corpus['Short description'].apply(lambda x: len(str(x).split()))
tickets_corpus.head()
print ('Maximum length of single record in Short Description ', tickets_corpus['short_desc_len'].max())
print ('Minimum length of single record in Short Description ', tickets_corpus['short_desc_len'].min())
print ('Average length of single record in Short Description', tickets_corpus['short_desc_len'].mean())
print ('Maximum Word count of single record of Short Description', tickets_corpus['short_des_word_count'].max())
print ('Minimum Word count of single record of Short Description', tickets_corpus['short_des_word_count'].min())
print ('Average Word count of single record of Short Description', tickets_corpus['short_des_word_count'].mean())
#Total words in the 'Short Description'
short_des_all_words = list(tickets_corpus['Short description'].str.lower().str.split(' ', expand=True).stack().unique())
print ('Total words in Short Description Column', len(short_des_all_words))
Let's see the Top 5 Short descriptions!
pd.set_option('display.max_colwidth',None) # To display full length value of columns
tickets_corpus[["Description","short_des_word_count"]].sort_values(by="short_des_word_count",ascending=False).head(5)
Description Column
# Length of each description
tickets_corpus['Desc_len'] = tickets_corpus['Description'].astype(str).apply(len)
# we are temporarily creating a column in the dataframe for the number of words
tickets_corpus['Des_word_count'] = tickets_corpus['Description'].apply(lambda x: len(str(x).split(" ")))
tickets_corpus.head(5)
print ('Maximum length of single record of Description', tickets_corpus['Desc_len'].max())
print ('Minimum length of single record of Description', tickets_corpus['Desc_len'].min())
print ('Average length of single record of Description', tickets_corpus['Desc_len'].mean())
print ('Maximum Word count of single record of Description', tickets_corpus['Des_word_count'].max())
print ('Minimum Word count of single record of Description', tickets_corpus['Des_word_count'].min())
print ('Average Word count of single record of Description', tickets_corpus['Des_word_count'].mean())
#Total words in the 'Description' column
des_all_words = list(tickets_corpus['Description'].str.lower().str.split(' ', expand=True).stack().unique())
print ('Total words in Description Column', len(des_all_words))
Lets see the Top 5 longest descriptions!
tickets_corpus[["Description","Des_word_count"]].sort_values(by="Des_word_count",ascending=False).head(5)
Analyzing Caller Column
Check the number of unique callers in the dataset
#Removing space between Caller Full Name to count unique callers
tickets_corpus['Caller']= tickets_corpus['Caller'].replace(" ","", regex=True)
Unique_Callers= tickets_corpus['Caller'].str.split(expand=True).stack().value_counts()
print ('Number of Unique callers in the Dataset', len(tickets_corpus['Caller'].str.split(expand=True).stack().value_counts()))
Lets see the top 10 callers in raising tickets
top_callers = tickets_corpus.groupby(['Caller']).size().nlargest(10)
print(top_callers)
Only one caller has raised 810 tickets, rest of the callers are raised only < 200 tickets
Lets check if any caller raised the tickets for multiple groups
top_c = tickets_corpus['Caller'].groupby(tickets_corpus['Assignment group']).value_counts()
grp_caller =pd.DataFrame(top_c.groupby(level=0).nlargest(5).reset_index(level=0, drop=True))
multy_caller = grp_caller[grp_caller.Caller.duplicated()]
grp_caller.head(20)
multy_caller_unique = [idx[1] for idx in multy_caller.index[multy_caller.Caller.unique()]]
multy_caller_unique
The above callers have raised tickets for multiple groups.
As per our above analysis, we dont see any significance relationship between the 'Caller' and group to which tickets are assigned. so for the model building, we may can avoid this column
init_notebook_mode(connected=True)
all_words = tickets_corpus['Description'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
x = all_words.index.values[2:50],
y = all_words.values[2:50],
marker= dict(colorscale='Viridis',
color = all_words.values[2:100]
),
text='Word counts'
)]
layout = go.Layout(
title='Frequent Occuring word (unclean) in Description'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic-bar')
all_words = tickets_corpus['Short description'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
x = all_words.index.values[2:50],
y = all_words.values[2:50],
marker= dict(colorscale='Viridis',
color = all_words.values[2:100]
),
text='Word counts'
)]
layout = go.Layout(
title='Frequent Occuring word (unclean) in Short Description'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic-bar')
plt.figure(figsize=(20,12))
tickets_corpus["Assignment group"].value_counts().plot.pie(autopct='%1.2f%%', fontsize=10, startangle=90)
From the above plot we can see that 46.73% of the data is for GRP_0
Some importany insights from the above plot- group 0,8,24, 12, 9, 2, 19 have highest number of cases tagged. Data is highly biased towards GRP_0 incidents
ax = tickets_corpus.hist(column='short_des_word_count', bins=25, grid=False, figsize=(8,6), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
# Set x-axis label
x.set_xlabel("Short Description word count", labelpad=20, weight='bold', size=12)
# Set y-axis label
x.set_ylabel("Count", labelpad=20, weight='bold', size=12)
ax = tickets_corpus.hist(column='Des_word_count', bins=25, grid=False, figsize=(8,6), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
# Set x-axis label
x.set_xlabel("Description Word count", labelpad=20, weight='bold', size=12)
# Set y-axis label
x.set_ylabel("Count", labelpad=20, weight='bold', size=12)
x.set_title("Description word count")
plt.figure(figsize=(22,10))
sns.set_style("whitegrid")
sns.countplot("Assignment group",data=tickets_corpus)
plt.xticks(rotation=90)
plt.title("Frequency of Assignment groups",fontsize=20)
plt.xlabel("Assignment groups",fontsize=8)
plt.ylabel("No.of tickets",fontsize=8)
tickets_corpus.columns
# Merge the Short descrition and Description column texts to create a new column
tickets_corpus.insert(loc=8,
column='ticket_summary',
allow_duplicates=True,
value=list(tickets_corpus['Short description'].str.strip() + ' ' + tickets_corpus['Description'].str.strip()))
#check the merged column is created properly or not
tickets_corpus['ticket_summary'].head()
tickets_corpus['Language'] = tickets_corpus['ticket_summary'].apply(detect)
# validating the languages present in the 'text' column using google language detection package.
print ('Various languages detected includes', tickets_corpus.groupby(['Language']).size())
print ('Total number of records with multiple languages detected is', len(tickets_corpus['Language']))
print ('Other than english langauge records are', tickets_corpus[~tickets_corpus['Language'].str.contains("en", na=False)].count())
Total 28 languages present in the dataset, majority is english, dutuch, african and french.Around 1393 records are of other langauge out of total 7647 records. As of now , we have not handling these data.
preprocessing the text simply means to bring your text into a form that is predictable and analyzable.
There are different ways to preprocess your text. Here are some of the approaches we followed:
Convert to lowercase
tickets_corpus['ticket_summary'] = tickets_corpus['ticket_summary'].apply(lambda x: str(x).lower())
Text Cleaning: Removing unwanted characters, special symbols, and tags.
def getList():
"""To prepare a list having all unneccessary tags,special characters and not useful words in our data."""
rmvList = []
rmvList += ['received from:(.*)'] # received data line
rmvList += ['From:(.*)'] # from line
rmvList += ['Sent:(.*)'] # sent to line
rmvList += ['To:(.*)'] # to line
rmvList += ['CC:(.*)'] # cc line
rmvList += ['https?:[^\]\n\r]+'] # https & http
rmvList += ['[\r\n]'] # for \r\n
rmvList += ['[^a-zA-Z\s]']
rmvList += ['sid_']
rmvList += ['erp ']
return rmvList
def cleanDataset(col, rmvList):
"""Function to clean the dataset by calling getList() function"""
for ex in rmvList:
col = col.str.replace(ex.lower(), '')
return col
#just check for one sample
tickets_corpus.loc[[21]]['ticket_summary']
print(cleanDataset(tickets_corpus.loc[[21]]['ticket_summary'], getList()))
#Lets apply Cleaning to entire data
tickets_corpus['ticket_summary'] = cleanDataset(tickets_corpus['ticket_summary'], getList())
tickets_corpus['ticket_summary'].head()
Removing Punctuatons
Removed punctuation as it can become a hindrance to the following preprocessing steps.
# !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
"""custom function to remove the punctuation"""
return str(text).translate(str.maketrans('', '', PUNCT_TO_REMOVE))
tickets_corpus["ticket_summary"] = tickets_corpus["ticket_summary"].apply(lambda text: remove_punctuation(text))
Removing Stopwords
Stopwords are very common words. Words like “we” and “are” probably do not help at all in NLP tasks such as sentiment analysis or text classifications. Hence, we can remove stopwords to save computing time and efforts in processing large volumes of text.We have used stopwords from nltk and extended with more words depends on the corpus
#we are using stopwords from nltk
nltk.download('stopwords')
Extending stopwords according to our corpus and removing all stopwords from data
STOPWORDS = stopwords.words('english')
STOPWORDS.extend(["sr", "psa", "perpsr", "psa", "good", "evening", "will", "night", "afternoon","png", "mailto" "ca","nt","at" "i", "vip", "llv", "xyz",
"cid", "image", "gmail","co", "in", "com", "ticket", "company", "received", "0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "A", "a1", "a2",
"a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad",
"added", "adj", "ae", "af", "affected", "affecting", "after", "afterwards", "ag", "again", "against", "ah", "ain", "aj", "al", "all",
"allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount",
"an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anyway", "anyways", "anywhere", "ao", "ap", "apart",
"apparently", "appreciate", "approximately", "ar", "are", "aren", "arent", "arise", "around","articl", "as", "aside", "ask", "asking", "at", "au",
"auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "B", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became",
"been", "before", "beforehand", "beginnings", "behind", "below", "beside", "besides", "best", "between", "beyond", "bi", "bill", "biol",
"bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "C", "c1", "c2", "c3",
"ca", "call", "came", "can", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "ci", "cit", "cj", "cl", "clearly", "cm", "cn",
"co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "could", "couldn", "couldnt", "course",
"cp", "cq", "cr", "cry", "cs", "ct", "cu", "cv", "cx", "cy", "cz", "d", "D", "d2", "da", "date", "dc", "dd", "de", "definitely",
"describe", "described", "despite", "detail", "df", "di", "did", "didn", "dj", "dk", "dl", "do", "does", "doesn", "doing", "don",
"done", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "E", "e2", "e3", "ea", "each", "ec",
"ed", "edu", "ee", "ef", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "en", "end", "ending",
"enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every",
"everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "F", "f2", "fa", "far", "fc", "few",
"ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows",
"for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore",
"fy", "g", "G", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going",
"gone", "got", "gotten", "gr", "greetings","greeting", "gs", "gy", "h", "H", "h2", "h3", "had", "hadn", "happens", "hardly", "has", "hasn", "hasnt",
"have", "haven", "having", "he", "hed", "hi","hello", "help", "hence", "here", "hereafter", "hereby", "herein", "heres", "hereupon", "hes",
"hh", "hi", "hid", "hither", "hj", "ho", "hopefully", "how", "howbeit", "however", "hs", "http", "hu", "hundred", "hy", "i2", "i3", "i4",
"i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "im", "immediately", "in",
"inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest",
"into", "inward", "io", "ip", "iq", "ir", "is", "isn", "it", "itd", "its", "iv", "ix", "iy", "iz", "j", "J", "jj", "jr", "js",
"jt", "ju", "just", "k", "K", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "ko", "l", "L", "l2", "la", "largely", "last",
"lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "lf", "like", "liked",
"likely", "line", "little", "lj", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "M", "m2",
"ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "meantime", "meanwhile", "merely", "mg", "might", "mightn",
"mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu",
"much", "mug", "must", "mustn", "my", "n", "N", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily",
"neither", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "nobody", "non", "none", "nonetheless", "noone",
"normally", "nos", "noted", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "O", "oa", "ob", "obtain", "obtained", "obviously",
"oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones",
"only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "otherwise", "ou", "ought", "our", "out", "outside", "over", "overall",
"ow", "owing", "own", "ox", "oz", "p", "P", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly",
"pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po",
"poorly", "pp", "pq", "pr", "predominantly", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps",
"pt", "pu", "put", "py", "q", "Q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "R", "r2", "ra", "ran", "rather", "rc", "rd", "re",
"readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively",
"research", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq",
"rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "S", "s2", "sa", "said", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second",
"secondly", "section", "seem", "seemed", "seeming", "seems", "seen", "sent", "seven", "several", "sf", "shall", "shan", "shed", "shes",
"show", "showed", "shown", "showns", "shows", "si", "side", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so",
"some", "somehow", "somethan", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified",
"specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such",
"sufficiently", "suggest", "sup", "sure", "sy", "sz", "t", "T", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te",
"tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "thats", "the", "their", "theirs", "them", "themselves",
"then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "thereof", "therere", "theres", "thereto",
"thereupon", "these", "they", "theyd", "theyre", "thickv", "thin", "think", "third", "this", "thorough", "thoroughly", "those", "thou",
"though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn",
"to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "tt",
"tv", "twelve", "twenty", "twice", "two", "tx", "u", "U", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless",
"unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "used", "useful", "usefully", "usefulness", "using",
"usually", "ut", "v", "V", "va", "various", "vd", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt",
"vu", "w", "W", "wa", "was", "wasn", "wasnt", "way", "we", "wed", "welcome", "well", "well-b", "went", "were", "weren", "werent", "what",
"whatever", "whats", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon",
"wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "whom", "whomever", "whos", "whose",
"why", "wi", "widely", "with", "within", "without", "wo", "won", "wonder", "wont", "would", "wouldn", "wouldnt", "www", "x", "X",
"x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "Y", "y2", "yes", "yet", "yj", "yl", "you",
"youd", "your", "youre", "yours", "yr", "ys", "yt", "z", "Z", "zero", "zi", "zz"])
def remove_stopwords(text):
"""custom function to remove the stopwords"""
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
tickets_corpus["ticket_summary"] = tickets_corpus["ticket_summary"].apply(lambda text: remove_stopwords(text))
tickets_corpus["ticket_summary"].head(10)
Convert Accented Characters
Words with accent marks like “latté” and “café” can be converted and standardized to just “latte” and “cafe”, else our NLP model will treat “latté” and “latte” as different words even though they are referring to same thing. To do this, we use the module unidecode.
def remove_accented_chars(text):
"""remove accented characters from text, e.g. café"""
text = unidecode.unidecode(text)
return text
tickets_corpus['ticket_summary'] = tickets_corpus["ticket_summary"].apply(lambda text: remove_accented_chars(text))
tickets_corpus['ticket_summary'].head(10)
Lemmatization
Lemmatization is the process of converting a word to its base form, e.g., “caring” to “care”. We use spaCy’s lemmatizer to obtain the lemma, or base form, of the words.
import en_core_web_sm
nlp = en_core_web_sm.load()
#nlp = spacy.load('en_core_web_md')
# function to lemmatize the descriptions
def lemmatize(sentence):
spacy_doc = nlp(sentence) # Parse the sentence using the loaded 'en' model object `nlp`
return " ".join([token.lemma_ for token in spacy_doc if token.lemma_ !='-PRON-'])
# Apply the Lemmatization to ticket_summary
tickets_corpus['ticket_Desc_lemm'] = tickets_corpus['ticket_summary'].apply(lemmatize)
# Verify the data after lemmatization
tickets_corpus['ticket_Desc_lemm'].tail(10)
wordcloud is another helpful visualization tool.Wordcloud package helps to create word clouds by placing words on a canvas randomly, with sizes proportional to their frequency in the text.
text = (tickets_corpus[tickets_corpus['Assignment group'] == 'GRP_0']['ticket_Desc_lemm']).to_string(index=False)
wordcloud = WordCloud().generate(text)
# plot the WordCloud image
plt.figure(figsize = (15, 12), facecolor = None)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
As we can see from the above wordcloud, most of the tickets are related to 'Password reset', 'account lock', "outlook issues', "unable to login","internet access", "email issues","skype issues" etc.
text = (tickets_corpus['ticket_Desc_lemm']).to_string(index=False)
wordcloud = WordCloud().generate(text)
# plot the WordCloud image
plt.figure(figsize = (15, 12), facecolor = None)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
As we can see, If we look at the ticket descriptions overall, most of the tickets are for job scheduler fail, password reset, account lock,circuit issues etc
As we have done with our all preprocessing steps, lets move further to create the model.First lets experiment with some of the classifier algorithms and see the performance.
Overview of this step:
Building a model architecture which can classify.
Trying different model architectures by researching state of the art for similar tasks.
Train the model
To deal with large training time, save the weights so that you can use them when training the model for the second time without starting from scratch.
Lets experiment with different algorithms such as:
Multinomial Naive Bayes
K Nearest neighbor
Support Vector Machine
Decission Tree
Random Forest
LSTM
The raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
Before creating the above classifier models, let's first vectorize our inpur data.
Scikit-learn's CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.
TF-IDF or Term Frequency(TF) — Inverse Dense Frequency(IDF) is a technique which is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers. Lets use this for running our base classification models.
In text analysis with machine learning, TF-IDF algorithms help sort data into categories, as well as extract keywords. This means that simple, monotonous tasks, like tagging support tickets or rows of feedback and inputting data can be done in seconds.
X_train, X_test, y_train, y_test = train_test_split(tickets_corpus['ticket_Desc_lemm'], tickets_corpus['Assignment group'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train.shape,y_train.shape,X_test.shape,y_test.shape
The below classifiers are run and compared:
--Multinomial Naive Bayes
--K Nearest neighbor
--Support Vector Machine
--Decission Tree
--Random Forest
--LSTM
Multinomial Naive Bayes
Naive Bayes is a family of algorithms based on applying Bayes theorem with a strong(naive) assumption, that every feature is independent of the others, in order to predict the category of a given sample. They are probabilistic classifiers, therefore will calculate the probability of each category using Bayes theorem, and the category with the highest probability will be output. Naive Bayes classifiers have been successfully applied to many domains, particularly NLP.
Advantages:
Disadvantages:
clf = MultinomialNB().fit(X_train_tfidf, y_train)
y_train_pred_NB = clf.predict(count_vect.transform(X_train))
y_test_pred_NB = clf.predict(count_vect.transform(X_test))
print("Multinomial NaiveBayers :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_NB) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_NB) * 100))
K Nearest Neighbor
KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution.This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions.Lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase.
Advantages:
Disadvantages:
clf_knn = KNeighborsClassifier(n_neighbors=7,weights='uniform').fit(X_train_tfidf, y_train)
y_train_pred_knn = clf_knn.predict(count_vect.transform(X_train))
y_test_pred_knn = clf_knn.predict(count_vect.transform(X_test))
print("K Nearest Neighbours :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_knn) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_knn) * 100))
Support Vector Machine
SVM (Support Vector Machine) classifies the data using hyperplane which acts like a decision boundary between different classes. Extreme data points from each class are called Support Vectors. SVM tries to find the best and optimal hyperplane which has maximum margin from each Support Vector. Support vectors are the data points, which are closest to the hyperplane. These points will define the separating line better by calculating margins.
The linear kernel is often recommended for text classification.
Advantages:
Disadvantages:
clf_svc = LinearSVC().fit(X_train_tfidf, y_train)
y_train_pred_svc = clf_svc.predict(count_vect.transform(X_train))
y_test_pred_svc = clf_svc.predict(count_vect.transform(X_test))
print("Support Vector Machine :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_svc) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_svc) * 100))
Decision Tree
A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.
Advantages:
Disadvantages:
clf_tree = DecisionTreeClassifier().fit(X_train_tfidf, y_train)
y_train_pred_tree = clf_tree.predict(count_vect.transform(X_train))
y_test_pred_tree = clf_tree.predict(count_vect.transform(X_test))
print("Decision Tree Classifier :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_tree) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_tree) * 100))
RandomForest Classifier
Due to its algorithmic simplicity and prominent classification performance for high dimensional data, random forest has become a promising method for text categorization. Random forest is a popular classification method which is an ensemble of a set of classification trees.
Advantages:
Disadvantages:
clf_rand = RandomForestClassifier(n_estimators=100).fit(X_train_tfidf, y_train)
y_train_pred_rand = clf_rand.predict(count_vect.transform(X_train))
y_test_pred_rand = clf_rand.predict(count_vect.transform(X_test))
print("RandomForest Classifier:")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_rand) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_rand) * 100))
Comparing Classification Models
The 10-fold cross validation procedure is used to evaluate each algorithm, importantly configured with the same random seed to ensure that the same splits to the training data are performed and that each algorithms is evaluated in precisely the same way.Each algorithm is given a short name, useful for summarizing results afterward.
# Comparing models
models = []
models.append(('MNB', MultinomialNB()))
models.append(('KNN', KNeighborsClassifier(n_neighbors=7)))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFC', RandomForestClassifier(n_estimators=100)))
models.append(('SVM', LinearSVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10)
cv_results = model_selection.cross_val_score(model, X_train_tfidf, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
LSTM stands for Long short-term memory. An LSTM module (or cell) has 5 essential components which allows it to model both long-term and short-term data.LSTM is a special type of RNN that preserves long term dependency in a more effective way compared to the basic RNNs. This is particularly useful to overcome vanishing gradient problem as LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state.LSTM in its core, preserves information from inputs that has already passed through it using the hidden state.Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.
Advantages:
We are using the bidirectional LSTM neural network for the classification.Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems.Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of the input sequence. This can provide additional context to the network and result in faster and even fuller learning on the problem
texts = tickets_corpus['ticket_Desc_lemm'].values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
tickets_corpus['token_text_vocab'] = tokenizer.texts_to_sequences(texts)
vocab_words = tokenizer.word_index.items()
len(vocab_words)
#Get the vocabulary size
num_words = len(tokenizer.word_index) +1
print (num_words)
#To view the 10 elements from dictionary
from itertools import islice
def take(n, iterable):
"Return first n items of the iterable as a list"
return list(islice(iterable, n))
take(10, vocab_words)
maxlen=300
max_features = 10000
X = tokenizer.texts_to_sequences(tickets_corpus['ticket_Desc_lemm'])
X = pad_sequences(X, padding='post',maxlen = maxlen)
# Converting categorical labels to numbers.
y = pd.get_dummies(tickets_corpus['Assignment group']).values
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
print("X[0] = ",X[0])
print("y[0] = ",y[0])
Get embedding using the pre-trained model Glove
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors.
Here we are using 'glove.6B.200d.txt' file which is trained on a corpus of 6 billion tokens and contains a vocabulary of 400 thousand tokens.
glove_file = project_path + "glove.6B.zip"
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
z.extractall()
#Get the Word Embeddings using Embedding file
EMBEDDING_FILE = './glove.6B.200d.txt'
embeddings = {}
for o in open(EMBEDDING_FILE, encoding="utf8",errors='ignore'):
word = o.split(" ")[0]
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
embeddings[word] = embd
len(embeddings.values())
#Just checking the sample embeddings for the word 'outlook' which is from our corpus
embeddings['outlook']
It is 200 dimension word embedding for the word 'outlook'
#Create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 200))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
we have created the embedding vector for all the words in our vocabulary.
After all the above data transformation, now that we have all the features and labels, it is time to train the classifiers. There are a number of algorithms we can use for this type of problem.
Split the dataset for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Now lets Create the LSTM model.(Bidirectional)
Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.
#parameters used
epochs = 20
batch_size = 60
embedding_size = 200
model = Sequential()
model.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
# mask_zero=True,
trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(74, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
### save the model so that you can use them again
output_dir = 'model_output/LSTM'
modelcheckpoint = ModelCheckpoint(filepath=output_dir+"/weights.{epoch:02d}.hdf5")
if not os.path.exists(output_dir):
os.makedirs(output_dir)
history = model.fit(X_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
model.load_weights(output_dir+"/weights.14.hdf5") # saving the weights
y_pred = model.predict(X_test)
Accuracy of the model
acc_test =model.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])
acc_train =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
Plot the Accuracy of the classifier
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
Plot the Loss of the Classifier
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
Confusion Matrix
A confusion matrix is a technique for summarizing the performance of a classification algorithm.Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.The confusion matrix shows the ways in which your classification model is confused when it makes predictions.It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.
conf_mat = confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
#fig, ax = plt.subplots(figsize=(20,20))
plt.figure(figsize=(22,22))
sns.heatmap(conf_mat, annot=True, fmt='d',
xticklabels=tickets_corpus['Assignment group'].unique(), yticklabels=tickets_corpus['Assignment group'].unique())
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.
Many Assignment groups are not present in the test data. The diagonal element value for GRP_0 is high
Classification Reports
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
Evaluation comparison of the above classifier models:
Out of all the models we've tried, Support Vector Machine and RandomForestClassifier are performing better than all others. But these models are highly overfitted and one of the obvious reason was the dataset was highly imbalanced.
The Accuracy of the Models:
| Algorithm | Train_Accuracy | Test_accuracy | |-------------------------|------------------|----------------| | Multinomail NB | 62.52 | 61.28 | --------------------------|------------------|----------------| | K Nearest Neighbours | 66.87 | 64.67 | --------------------------|------------------|----------------| | Support Vector Machine | 91.32 | 67.73 | --------------------------|------------------|----------------| | Decision Tree Classifier| 63.27 | 50.82 | --------------------------|------------------|----------------| | RandomForest Classifier | 84.22 | 64.48 | --------------------------|------------------|----------------| | Bidirectional LSTM | 75.10 | 63.39 | --------------------------|------------------|----------------|
LSTM is efficient of dealing with textual data. Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on classification problems.Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.
We can try improving the performance of the above LSTM model by tuning the hyperparameters,and checking other possible refinements.
Let's test the model for a new incident ticket which is not present in our train and test datasets and find out how the model predict the assignment group for same.
ticket = ['caller confirmed that he was able to login, checked the user name in ad and reset the password']
#vectorizing the tweet by the pre-fitted tokenizer instance
ticket = tokenizer.texts_to_sequences(ticket)
#padding the tweet to have exactly the same shape as `embedding_2` input
ticket = pad_sequences(ticket, maxlen=maxlen, value=0.0, padding='post')
print("Ticket :",ticket)
output = model.predict(ticket)
print("Output:",output)
def decode(datum):
return np.argmax(datum)
decoded_Y = []
print("****************************************")
for i in range(output.shape[0]):
datum = output[i]
#print('index: %d' % i)
#print('encoded datum: %s' % datum)
decoded_datum = decode(output[i])
#print('decoded datum: %s' % decoded_datum)
decoded_Y.append(tickets_corpus['Assignment group'][decoded_datum])
print("Decoded_y:" , decoded_Y)
The model has predicted the incident ticket assignment group as GRP_0.
#saving the data to a CSV file.
file_name='preprocessed_input_data.csv'
tickets_corpus.to_csv(file_name,encoding='utf-8',index=False)
#To delimit by a tab you can use the 'sep' argument
#When you are storing a DataFrame object into a csv file using the to_csv method,
#no need to store the preceding indices of each row of the DataFrame object so passing a False boolean value to index parameter.
1. LSTM Merge Mode
The Bidirectional wrapper layer also allows to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer.
The options are:
'sum': The outputs are added together.
'mul': The outputs are multiplied together.
'concat': The outputs are concatenated together (the default), providing double the number of outputs to the next layer.
'ave': The average of the outputs is taken.
'concat'is the default merge mode. Merge mode 'mul' and 'ave' didn't show any improvements in F1 score. However merge mode of 'sum' showed improved F1 score.
Look the following results with 17 epochs.
Fit an LSTM model with merge_mode="sum"
model = Sequential()
model.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
# mask_zero=True,
trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(74, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history_mode_sum = model.fit(X_train,
y_train,
epochs=17,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
plt.plot(history_mode_sum.history['accuracy'])
plt.plot(history_mode_sum.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history_mode_sum.history['loss'])
plt.plot(history_mode_sum.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
y_pred_mode_sum = model.predict(X_test)
acc_test =model.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])
acc_train =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred_mode_sum.argmax(axis=1))))
With LSTM merge mode SUM, Test accuracy has improved to 63%, however training accuracy improved to 75%. The average F1 Score of the model is 0.60. We can go ahead with SUM.
2.Number of LSTM Cells
We cannot know the best number of memory cells for a given LSTM architecture. We must test a suite of different memory cells in LSTM hidden layers to see what works best. Let's take 3 different numbers of LSTM cells, 50, 100 and 200.
epochs_lstm_cells = 2
params = [50, 100, 200]
n_repeats = 2
# fit an LSTM model
def fit_model(n_cells):
# define model
model_lstm_cells = Sequential()
model_lstm_cells.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
model_lstm_cells.add(SpatialDropout1D(0.2))
model_lstm_cells.add(Bidirectional(LSTM(n_cells, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_lstm_cells.add(Dense(100, activation='relu'))
model_lstm_cells.add(Dropout(0.1))
model_lstm_cells.add(Dense(74, activation='softmax'))
# compile model
model_lstm_cells.compile(loss='mse', optimizer='adam')
# fit model
#X_train, X_test, y_train, y_test
model_lstm_cells.fit(X_train,
y_train,
epochs=epochs_lstm_cells,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
# evaluate model
loss = model_lstm_cells.evaluate(X_test, y_test, verbose=0)
return loss
# grid search parameter values
scores = DataFrame()
for value in params:
# repeat each experiment multiple times
loss_values = list()
for i in range(n_repeats):
loss = fit_model(value)
loss_values.append(loss)
print('>%d/%d param=%f, loss=%f' % (i+1, n_repeats, value, loss))
# store results for this parameter
scores[str(value)] = loss_values
# summary statistics of results
print(scores.describe())
# box and whisker plot of results
scores.boxplot()
pyplot.show()
By Increasing the number of LSTM cells from 100 to 200 we can see the reduction in overall loss.
3. Regularization
LSTMs can quickly converge and even overfit on some sequence prediction problems. To counter this, regularization methods can be used. LSTMs supports regularization such as weight regularization that imposes pressure to decrease the size of network weights. Again, these can be set on the LSTM layer with the arguments.
model_regularized = Sequential()
model_regularized.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
model_regularized.add(SpatialDropout1D(0.2))
model_regularized.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_regularized.add(Dense(100, activation='relu', kernel_regularizer=tf.keras.regularizers.l1(0.01),
activity_regularizer=tf.keras.regularizers.l2(0.01)))
model_regularized.add(Dropout(0.1))
model_regularized.add(Dense(74, activation='softmax'))
model_regularized.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history_regularized = model_regularized.fit(X_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
plt.plot(history_regularized.history['accuracy'])
plt.plot(history_regularized.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history_regularized.history['loss'])
plt.plot(history_regularized.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
y_pred_regularized = model_regularized.predict(X_test)
acc_test_regularized =model_regularized.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test_regularized[1])
acc_train_regularized =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train_regularized[1])
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred_regularized.argmax(axis=1))))
By adding regualarization on Dense layer using kernel_regularizer and activity_regularizer, no improvement is seen on train and validation data. F1 score dropped from 0.60 to 0.50 may be because of less data related to other categories.
4. Weight Initialization
The Keras LSTM layer uses the glorot uniform weight initialization by default. This weight initialization works well in general.
Lets try normal type weight initialization with LSTMs and see if we can get better results.
initializer = tf.keras.initializers.GlorotNormal()
model_normalized = Sequential()
model_normalized.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
# mask_zero=True,
trainable=False))
model_normalized.add(SpatialDropout1D(0.2))
model_normalized.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_normalized.add(Dense(100, activation='relu', kernel_initializer=initializer))
model_normalized.add(Dropout(0.1))
model_normalized.add(Dense(74, activation='softmax'))
model_normalized.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs_initializer_test = 10
history_normalized = model_normalized.fit(X_train,
y_train,
epochs=epochs_initializer_test,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
plt.plot(history_normalized.history['accuracy'])
plt.plot(history_normalized.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
y_pred_normalized = model_normalized.predict(X_test)
acc_test_normalized = model_normalized.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test_normalized[1])
acc_train_normalized = model_normalized.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train_normalized[1])
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred_normalized.argmax(axis=1))))
Comparing with the Bidirectional LSTM model with merge_mode="sum", after adding the kernel_initializer=GlorotNormal() in the Dense layer, the test accuracy is almost same as 63%. Training accuracy reduced from 75% to 68%. F1 score is 0.57. We can prefer to use GlorotNormal.
5. Pipeline
Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated. Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.
# this calculates a vector of term frequencies
vect = CountVectorizer()
# this normalizes each term frequency
tfidf = TfidfTransformer()
#linear SVM classifier
clf = LinearSVC()
from sklearn.pipeline import Pipeline
nlp_pipeline = Pipeline([
('vect',vect),
('tfidf',tfidf),
('clf',clf)
])
#Splitting the train and test data
X_train_pip, X_test_pip, y_train_pip, y_test_pip = train_test_split(tickets_corpus['ticket_Desc_lemm'], tickets_corpus['Assignment group'], random_state = 0)
X_train_pip.shape,y_train_pip.shape,X_test_pip.shape,y_test_pip.shape
#fit trian data to the pipeline
nlp_pipeline.fit(X_train_pip,y_train_pip)
# predict test instances
y_preds = nlp_pipeline.predict(X_test_pip)
# calculate f1
mean_f1 = f1_score(y_test_pip, y_preds, average='micro')
print('Mean f1 Score ---',mean_f1)
print(classification_report(y_test_pip, y_preds))
Pipeline and Feature Union as such doesnot improve performance of the models. Its adds more value by combining different rules and models, we can define out own transformers that will improve the performance. Here we have done basic pipeline model.
Pipelines help in optimizing entire workflow, preventing data leakage and code simplicity.
BERT (Bidirectional Encoder Representations from Transformers).
BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling.This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training.
Here we used BERT_MODEL = 'uncased_L-12_H-768_A-12'
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
import tensorflow_hub as hub
print("tensorflow version : ", tf.__version__)
print("tensorflow_hub version : ", hub.__version__)
print(tf.__version__)
!pip uninstall tensorflow==2.2.0
!pip install tensorflow==1.15.0
%cd /content/drive/My Drive/BERT/
#Install necessary pretrained models files related to BERT.
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!wget https://raw.githubusercontent.com/google-research/bert/master/modeling.py
!wget https://raw.githubusercontent.com/google-research/bert/master/optimization.py
!wget https://raw.githubusercontent.com/google-research/bert/master/run_classifier.py
!wget https://raw.githubusercontent.com/google-research/bert/master/tokenization.py
import modeling
import optimization
import run_classifier
import tokenization
#Establishing path in gdrive for BERT model zip extraction
folder = '/content/drive/My Drive/BERT/'
with zipfile.ZipFile("uncased_L-12_H-768_A-12.zip","r") as zip_ref:
zip_ref.extractall(folder)
Create Folder for storing Model Output. We have decided to use the "bert_uncased_L-12_H-768_A-12" model. We will be using the vocab.txt file in the model to map the words in the dataset to indexes. Also the loaded BERT model is trained on uncased/lowercase data and hence the data we feed to train the model should also be of lowercase which is already performed in milestone 1 preprocessing work.
BERT_MODEL = 'uncased_L-12_H-768_A-12'
BERT_PRETRAINED_DIR = '/content/drive/My Drive/BERT/uncased_L-12_H-768_A-12'
OUTPUT_DIR = f'{folder}/outputs'
print(f'>> Model output directory: {OUTPUT_DIR}')
print(f'>> BERT pretrained directory: {BERT_PRETRAINED_DIR}')
X=tickets_corpus["ticket_Desc_lemm"].values
le = preprocessing.LabelEncoder()
le.fit(tickets_corpus['Assignment group'].values)
y = le.transform(tickets_corpus['Assignment group'].values)
#Split the dataframe into train and test in 80:20 split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
#change path to Folder where model is located
%cd /content/drive/My Drive/BERT/uncased_L-12_H-768_A-12
Create definition for importing dataset as per BERT input requirements. Also define necessary hyperparamters for model. We will try different Batch size, learning rate and maximum sequence lengths to achieve best possible accuracy and F1 score.
def create_examples(lines, set_type, labels=None):
#Generate data for the BERT model
guid = f'{set_type}'
examples = []
if guid == 'train':
for line, label in zip(lines, labels):
text_a = line
label = str(label)
examples.append(
run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
else:
for line in lines:
text_a = line
label = '0'
examples.append(
run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
# Model Hyper Parameters
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 8.0
WARMUP_PROPORTION = 0.1
MAX_SEQ_LENGTH = 128
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000 #if you wish to finetune a model on a larger dataset, use larger interval
# each checpoint weights about 1,5gb
ITERATIONS_PER_LOOP = 1000
NUM_TPU_CORES = 8
VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')
label_list = [str(num) for num in range(74)]
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)
train_examples = create_examples(X_train, 'train', labels=y_train)
tpu_cluster_resolver = None #Since training will happen on GPU, we won't need a cluster resolver
#TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
model_dir=OUTPUT_DIR,
save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=ITERATIONS_PER_LOOP,
num_shards=NUM_TPU_CORES,
per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))
num_train_steps = int(
len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)
model_fn = run_classifier.model_fn_builder(
bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
num_labels=len(label_list),
init_checkpoint=INIT_CHECKPOINT,
learning_rate=LEARNING_RATE,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available
use_one_hot_embeddings=True)
estimator = tf.contrib.tpu.TPUEstimator(
use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available
model_fn=model_fn,
config=run_config,
train_batch_size=TRAIN_BATCH_SIZE,
eval_batch_size=EVAL_BATCH_SIZE)
Convert our train and validation features to InputFeatures that BERT understands.Create a funciton to train the model.
print('Please wait...')
train_features = run_classifier.convert_examples_to_features(
train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('>> Started training at {} '.format(datetime.datetime.now()))
print(' Num examples = {}'.format(len(train_examples)))
print(' Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
features=train_features,
seq_length=MAX_SEQ_LENGTH,
is_training=True,
drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('>> Finished training at {}'.format(datetime.datetime.now()))
def input_fn_builder(features, seq_length, is_training, drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
def input_fn(params):
"""The actual input function."""
print(params)
batch_size = 500
num_examples = len(features)
d = tf.data.Dataset.from_tensor_slices({
"input_ids":
tf.constant(
all_input_ids, shape=[num_examples, seq_length],
dtype=tf.int32),
"input_mask":
tf.constant(
all_input_mask,
shape=[num_examples, seq_length],
dtype=tf.int32),
"segment_ids":
tf.constant(
all_segment_ids,
shape=[num_examples, seq_length],
dtype=tf.int32),
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
})
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
return d
return input_fn
Creating prediction function to run on test dataset
predict_examples = create_examples(X_test, 'test')
predict_features = run_classifier.convert_examples_to_features(
predict_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
predict_input_fn = input_fn_builder(
features=predict_features,
seq_length=MAX_SEQ_LENGTH,
is_training=False,
drop_remainder=False)
result = estimator.predict(input_fn=predict_input_fn)
preds = []
for prediction in result:
preds.append(np.argmax(prediction['probabilities']))
print("Accuracy of BERT is:",accuracy_score(y_test,preds))
print(classification_report(y_test,preds))
print("F1-Score of the model:")
f1_score(y_test, preds, average='weighted')
BERT model has shown better performance than the bidirectional LSTM model so far.
We have many assignment groups those do not have enough samples to train the classifier model. And also around 48% of tickets data belongs to one assignment group GRP_0.Since our data is highly unbalanced and it is biased towards the GRP_0, we are experimenting the below approches in order to make the data more balanced.
Since the data is highly biased towards GRP_0, model performance on other assignment groups are comparatively poor. So in order to make the model to train on the other assignment groups as well, we are downsampling the GRP_0 tickets.
But in the actual business data, we have more number of GRP_0 tickets and we dont want to manipulate the data to impact the actual business scenario.
So we are just downsampling the GRP_0 only on the train data, so that our model can train to classify the other assignment groups as well and also keeping the test data as same as from the business process.
#Lets Select all ticket Assignment groups which have only one ticket
rare_grps= tickets_corpus[tickets_corpus.groupby("Assignment group")["Assignment group"].transform('size') <30]['Assignment group'].unique()
rare_grps
#Lets check the total number of rare assignment groups
rare_grps.size
#Create a different dataframe for the tickets belongs to the rare groups
rare_df = tickets_corpus[tickets_corpus['Assignment group'].isin(rare_grps)]
rare_df.shape
# Rename the Assignment group attribute
rare_df['Assignment group'] = 'others'
#Lets check whether the group name has changed to 'others'
print(rare_df['Assignment group'].head(3))
#creating a dataframe excluding the rare groups from our original data
grp_exl_df = tickets_corpus[~tickets_corpus['Assignment group'].isin(rare_grps)]
grp_exl_df.shape
#Now lets add the rare groups df (having one assignment group as 'others') to the excluded dataframe
ticket_df = pd.concat([grp_exl_df,rare_df]).reset_index(drop=True)
ticket_df.shape
Now we have clubbed the minority assignment groups into a single group. Now lets downsample the GRP_0
#Lets split the data for training and testing for the undersampling from the ticket_df
X=ticket_df['ticket_Desc_lemm'].values
y2=ticket_df['Assignment group'].values
X_train, X_test, y_train, y_test = train_test_split(X, y2, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape
#Take only the training datasets(X_train and y_train) and covert it into a dataframe for futher processing
X_col_names=["ticket_Desc_lemm"]
y_col_names=["Assignment group"]
df_X = pd.DataFrame(X_train,columns = X_col_names)
df_Y = pd.DataFrame(y_train,columns = y_col_names)
df_train = pd.concat([df_X, df_Y], axis=1)
print("Shape of df_train:",df_train.shape)
df_train.head(5)
# filter the records assigned to only GRP_0
grp0_tickets = df_train[df_train['Assignment group'] == 'GRP_0']
grp0_tickets["Assignment group"].head(5)
It is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package to extract the hidden topics from large volumes of text.It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
Let's first use gensim to implement LDA and we are going to apply LDA to GRP_0 and split them into different topics.
The main inputs needed for doing LDA is:
# Vectorizations
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
# Tokenize the ticket_Desc attribute of GRP_0 records
df_words = list(sent_to_words(grp0_tickets['ticket_Desc_lemm'].values.tolist()))
df_words = [[word for word in simple_preprocess(str(doc)) if word not in STOPWORDS] for doc in df_words]
# Build the bigram
bigram = gensim.models.Phrases(df_words, min_count=5, threshold=100) # higher threshold fewer phrases.
# Faster way to get a sentence clubbed as bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
df_words_bigrams = [bigram_mod[doc] for doc in df_words]
# Create Dictionary
id2word = corpora.Dictionary(df_words_bigrams)
# Term Document Frequency
#using doc2bow,we create a dictionary reporting how many words and how many times those words appear.
corpus = [id2word.doc2bow(text) for text in df_words_bigrams]
#Lets build the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=3,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
for idx, topic in lda_model.print_topics():
print('Topic: {} \nWords: {}'.format(idx+1, topic))
print()
As we can see in the above output,the whole document is classfied into 3 topics and displayed the high weighted words for each topic.
#Run LDA for GRP_0
#Function to determine the topic
TOPICS = {1: "Password reset", 2:"account lock", 3:"connection issues",4:"others"}
def get_groups(text):
bow_vector = id2word.doc2bow([word for word in simple_preprocess(text) if word not in STOPWORDS])
index, score = sorted(lda_model[bow_vector][0], key=lambda tup: tup[1], reverse=True)[0]
return TOPICS[index+1 if score > 0.5 else 4], round(score, 2)
# Check for a Random record
text = grp0_tickets.reset_index().loc[np.random.randint(0, grp0_tickets.shape[1]),'ticket_Desc_lemm']
topic, score = get_groups(text)
print(f'Text:{text}\nTopic:{topic}\nScore:{score}')
# Apply the function to the df[ticket_Desc_lemm]
grp0_tickets.insert(loc=grp0_tickets.shape[1]-1,
column='Topic',
value=[get_groups(text)[0] for text in grp0_tickets.ticket_Desc_lemm])
grp0_tickets.head()
# Count the records based on Topics
grp0_tickets.Topic.value_counts()
X_sam= grp0_tickets.drop(['Assignment group','Topic'], axis=1)
y_sam=grp0_tickets.Topic
len(X_sam),len(y_sam)
def plot_pie(y):
""" a function to plot the pie chart showing the percentage of data in differnt topics after LDA"""
target_stats = Counter(y)
labels = list(target_stats.keys())
sizes = list(target_stats.values())
explode = tuple([0.1] * len(target_stats))
def make_autopct(values):
def my_autopct(pct):
total = sum(values)
val = int(round(pct * total / 100.0))
return '{p:.2f}% ({v:d})'.format(p=pct, v=val)
return my_autopct
fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, shadow=True,
autopct=make_autopct(sizes))
ax.axis('equal')
# Instantiate the UnderSampler class
sampling_strategy = 'auto'
rus = RandomUnderSampler(sampling_strategy=sampling_strategy, random_state=0)
# Fit the data
X_res, y_res = rus.fit_resample(X_sam,y_sam)
print('Information of the data set after making it '
'balanced by under-sampling: \n sampling_strategy={} \n y: {}'
.format(sampling_strategy, Counter(y_res)))
plot_pie(y_res)
#converting the above output numpy array to dataframe for further processing.
col_names = X_sam.columns
X_res = pd.DataFrame(X_res,columns = col_names)
y_res = pd.DataFrame(y_res,columns = ['Topic'])
type(y_res),type(X_res)
# Combine Topic and Assignment Group columns
grp0_df = pd.concat([X_res, y_res], axis=1)
grp0_df.shape
grp0_df["Assignment group"] = 'GRP_0'
grp0_df.drop(['Topic'], axis=1, inplace=True)
print(grp0_df.columns)
print(grp0_df['Assignment group'].head())
print("Total size of GRP_0 tickets after LDA:",grp0_df.shape)
#Create a dataframe exluding the GRP_0 tickets
df_excl_grp0 = df_train[df_train['Assignment group'] != 'GRP_0']
# Join the undersampled GRP_0 dataset to the excluded dataset
df = pd.concat([grp0_df, df_excl_grp0]).reset_index(drop=True)
df.shape
print(df.columns)
df[df["Assignment group"] == 'GRP_0'].count()
print('Unique groups remaining:', df['Assignment group'].nunique())
plt.figure(figsize=(20,12))
sns.set_style("whitegrid")
sns.countplot(df['Assignment group'])
plt.xticks(rotation=90)
plt.xlabel("Assignment groups")
plt.ylabel("Count")
plt.title("Frquency of Assignment Groups after undersampling and clubbing",fontsize=18)
#Creating the training datasets after undersampling
X_train = df["ticket_Desc_lemm"]
y_train = df["Assignment group"]
X_train.shape,y_train.shape
#Tokenize the data for the model
X_train=tokenizer.texts_to_sequences(df['ticket_Desc_lemm'])
X_train = pad_sequences(X_train, padding='post',maxlen = maxlen)
X_test=tokenizer.texts_to_sequences(ticket_df['ticket_Desc_lemm'])
X_test = pad_sequences(X_test, padding='post',maxlen = maxlen)
y_train = pd.get_dummies(df['Assignment group']).values
y_test= pd.get_dummies(ticket_df['Assignment group']).values
X_train.shape,y_train.shape,X_test.shape,y_test.shape
#Bidirectional model with merge_mode="sum" and kernel_initializer as 'GlorotNormal()'
model = Sequential()
model.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
mask_zero=True,
trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model.add(Dense(100, activation='relu', kernel_initializer=initializer))
model.add(Dropout(0.1))
model.add(Dense(36, activation='softmax'))
model.summary()
#Configure the model.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history_B = model.fit(X_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])
model.load_weights(output_dir+"/weights.19.hdf5") # saving the weights
acc_test =model.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])
acc_train =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
y_predB = model.predict(X_test)
groups = ticket_df['Assignment group'].unique()
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_predB.argmax(axis=1),target_names=groups)))
Plot the accuracy of the classfier
plt.plot(history_B.history['accuracy'])
plt.plot(history_B.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
Plot the Loss of the classifier
plt.plot(history_B.history['loss'])
plt.plot(history_B.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
#ROC-AUC Score of the Model:
"{:0.2f}".format(roc_auc_score(y_test,y_predB)*100.0)
Lets tune this model further!
#Bidirectional model with 150 LSTM neurons , merge_mode="sum" and kernel_initializer as 'GlorotNormal()'
model_chgNeur = Sequential()
model_chgNeur.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
# mask_zero=True,
trainable=False))
model_chgNeur.add(SpatialDropout1D(0.2))
model_chgNeur.add(Bidirectional(LSTM(150, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_chgNeur.add(Dense(100, activation='relu', kernel_initializer=initializer))
model_chgNeur.add(Dropout(0.1))
model_chgNeur.add(Dense(36, activation='softmax'))
#Configure the model.
model_chgNeur.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history_B1 = model_chgNeur.fit(X_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])
acc_test =model_chgNeur.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])
acc_train =model_chgNeur.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
y_predB1 = model_chgNeur.predict(X_test)
groups = ticket_df['Assignment group'].unique()
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_predB1.argmax(axis=1),target_names=groups)))
Plot the accuracy of the classifier
plt.plot(history_B1.history['accuracy'])
plt.plot(history_B1.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
Plot Loss of the classifier
plt.plot(history_B1.history['loss'])
plt.plot(history_B1.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
#ROC-AUC Score of the Model:
"{:0.2f}".format(roc_auc_score(y_test,y_predB1)*100.0)
maxlen= 150
#Now lets reinitialize and do padding the training and testing dataset
X_train=tokenizer.texts_to_sequences(df['ticket_Desc_lemm'])
X_train = pad_sequences(X_train, padding='post',maxlen = maxlen)
X_test=tokenizer.texts_to_sequences(ticket_df['ticket_Desc_lemm'])
X_test = pad_sequences(X_test, padding='post',maxlen = maxlen)
y_train = pd.get_dummies(df['Assignment group']).values
y_test= pd.get_dummies(ticket_df['Assignment group']).values
#Bidirectional model with maxlen = 150 ,merge_mode="sum" and kernel_initializer as 'GlorotNormal()'
model_chgLen = Sequential()
model_chgLen.add(Embedding(input_dim=num_words,
output_dim=embedding_size,
weights=[embedding_matrix],
input_length=maxlen,
mask_zero=True,
trainable=False))
model_chgLen.add(SpatialDropout1D(0.2))
model_chgLen.add(Bidirectional(LSTM(150, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_chgLen.add(Dense(100, activation='relu', kernel_initializer=initializer))
model_chgLen.add(Dropout(0.1))
model_chgLen.add(Dense(36, activation='softmax'))
#Configure the model.
model_chgLen.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Run the model
history_B2 = model_chgLen.fit(X_train,
y_train,
epochs=epochs,
batch_size=batch_size,
validation_data=(X_test, y_test),
callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])
acc_test =model_chgLen.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])
acc_train =model_chgLen.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
y_predB2 = model_chgLen.predict(X_test)
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_predB2.argmax(axis=1),target_names=groups)))
Plot Accuracy of the classifier
plt.plot(history_B2.history['accuracy'])
plt.plot(history_B2.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
Plot Loss of the Classifier
plt.plot(history_B2.history['loss'])
plt.plot(history_B2.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
#ROC-AUC Score of the Model:
"{:0.2f}".format(roc_auc_score(y_test,y_predB2)*100.0)
In this project, a model based on supervised machine learning algorithms is proposed to assign tickets automatically.Preprocessed dataset consisting of previously categorized tickets are used to train classification algorithms.
We have implemented different classification algorithms to evaluate performances comparatively. We tried tuning the model using different hyper parameters for better performance.
| Model Tuning Steps | F1 Score |
|---|---|
| Bidirectional LSTM[100 LSTM neurons,maxlen=300] | 0.58 |
| Bidirectional LSTM[100 LSTM neurons,maxlen=300,merge-code='sum'] | 0.60 |
| Bidirectional LSTM[100 LSTM neurons,maxlen=300,merge_mode="sum",with L1,L2 regularizer in the dense layer] | 0.50 |
| Bidirectional LSTM[100 neurons,merge-code='sum',kernel_initialiazer=GlorotNormal() in Dense layer] | 0.57 |
State of the Art NLP Model:
| Model | F1 Score |
|---|---|
| BERT[Uncased: 12-layer, 768-hidden, 12-heads] | 0.64 |
After Clubbing minority groups and undersampling GRP_0 in the training set :
| Model Tuning Steps | F1 Score | ROC-AUC Score |
|---|---|---|
| Bidirectional LSTM[100 LSTM neurons,maxlen=300,merge-code='sum',kernel_initialiazer=GlorotNormal in Dense layer] | 0.69 | 97.72 |
| Bidirectional LSTM[150 LSTM neurons,maxlen=300,merge-code='sum',kernel_initialiazer=GlorotNormal in Dense layer] | 0.73 | 97.91 |
| Bidirectional LSTM[150 LSTM neurons,maxlen=150,merge-code='sum',kernel_initialiazer=GlorotNormal in Dense layer] | 0.73 | 97.91 |